Search CORE

7 research outputs found

Spark-based Cloud Data Analytics using Multi-Objective Optimization

Author: Diao Yanlei
Fan Qi
Lyu Chenghao
Shenoy Prashant
Sinha Arnab
Song Fei
Zaouk Khaled
Publication venue: HAL CCSD
Publication date: 01/01/2021
Field of study

International audienceData analytics in the cloud has become an integral part of enterprise businesses. Big data analytics systems, however, still lack the ability to take user performance goals and budgetary constraints for a task, collectively referred to as task objectives, and automatically configure an analytic job to achieve these objectives. This paper presents a data analytics optimizer that can automatically determine a cluster configuration with a suitable number of cores as well as other system parameters that best meet the task objectives. At a core of our work is a principled multi-objective optimization (MOO) approach that computes a Pareto optimal set of job configurations to reveal tradeoffs between different user objectives, recommends a new job configuration that best explores such tradeoffs, and employs novel optimizations to enable such recommendations within a few seconds. We present efficient incremental algorithms based on the notion of a Progressive Frontier for realizing our MOO approach and implement them into a Spark-based prototype. Detailed experiments using benchmark workloads show that our MOO techniques provide a 2-50x speedup over existing MOO methods, while offering good coverage of the Pareto frontier. When compared to Ottertune, a state-of-the-art performance tuning system, our approach recommends configurations that yield 26%-49% reduction of running time of the TPCx-BB benchmark while adapting to different application preferences on multiple objectives

INRIA a CCSD electronic archive server

HAL Descartes

HAL-Polytechnique

Performance Modeling and Multi-Objective Optimization for Data Analytics in the Cloud

Author: Zaouk Khaled
Publication venue: HAL CCSD
Publication date: 20/09/2017
Field of study

International audienc

INRIA a CCSD electronic archive server

HAL-Polytechnique

Modélisation à Base de Réseaux de Neurones des Performances des Plateformes Cloud

Author: Zaouk Khaled
Publication venue: HAL CCSD
Publication date: 11/03/2021
Field of study

Cloud data analytics has become an integral part of enterprise business operations for data-driven insight discovery. Performance modeling of cloud data analytics is crucial for performance tuning and other critical operations in the cloud. Traditional modeling techniques fail to adapt to the high degree of diversity in workloads and system behaviors in this domain. In this thesis, we bring recent Deep Learning techniques to bear on the process of automated performance modeling of cloud data analytics, with a focus on Spark data analytics as representative workloads. At the core of our work is the notion of learning workload embeddings (with a set of desired properties) to represent fundamental computational characteristics of different jobs, which enable performance prediction when used together with job configurations that control resource allocation and other system knobs. Our work provides an in-depth study of different modeling choices that suit our requirements. Results of extensive experiments reveal the strengths and limitations of different modeling methods, as well as superior performance of our best performing method over a state-of-the-art modeling tool for cloud analyticsL'analyse des données en utilisant des ressources cloud est désormais omniprésente dans l'activité des entreprises qui s'engagent dans une transformation digitale pour mieux comprendre les données volumineuses dont elles disposent. La modélisation des performances des plateformes cloud utilisées dans ce contexte est une nécessité pour pouvoir garantir une bonne performance des requettes réparties (appelées jobs) ainsi qu'une meilleure gestion des ressources cloud. Les techniques de modélisation traditionnelles ne s'adaptent ni à la diversité de ces jobs ni aux différents comportements des systèmes distribués. Dans cette thèse, nous proposons des techniques récentes de Deep Learning pour pouvoir automatiser cette tâche de modélisation avec un focus en particulier sur la plateforme Spark utilisée pour les calculs distribués. Au coeur de notre travaux de recherche, on présente la notion d'apprentissage d'embeddings, vecteurs capables de décrire de façon compacte les caractéristiques fondamentales des différents jobs. Nous montrerons dans cette thèse comment ces embeddings permettent une meilleure prédiction des performances des jobs sous différentes configurations du système de calculs répartis. Nous aborderons aussi une étude de différents choix de modélisation à base de réseaux de neurones répondant à nos besoins. Les résultats de nos expériences révèlent les forces et les limites des différents choix de modélisation. Nos expériences dévoilent aussi des performances supérieures d'une méthode qu'on propose par rapport à l'état de l'art dans la modélisation des systèmes de gestion de base de données

INRIA a CCSD electronic archive server

Modélisation à Base de Réseaux de Neurones des Performances des Plateformes Cloud

Author: Zaouk Khaled
Publication venue: HAL CCSD
Publication date: 11/03/2021
Field of study

Thèses en Ligne

INRIA a CCSD electronic archive server

Theses.fr

HAL-Polytechnique

Neural-based Modeling for Performance Tuning of Spark Data Analytics

Author: Diao Yanlei
Lyu Chenghao
Song Fei
Zaouk Khaled
Publication venue: HAL CCSD
Publication date: 20/01/2021
Field of study

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL-Polytechnique

UDAO: A Next-Generation Unified Data Analytics Optimizer

Author: Diao Yanlei
Lyu Chenghao
Shenoy Prashant
Sinha Arnab
Song Fei
Zaouk Khaled
Publication venue: 'VLDB Endowment'
Publication date: 01/08/2019
Field of study

International audienceBig data analytics systems today still lack the ability to take user performance goals and budgetary constraints, collectively referred to as "objectives", and automatically configure an analytic job to achieve the objectives. This paper presents UDAO, a unified data analytics optimizer that can automatically determine the parameters of the runtime system, collectively called a job configuration, for general dataflow programs based on user objectives. UDAO embodies key techniques including in-situ modeling, which learns a model for each user objective in the same computing environment as the job is run, and multi-objective optimization, which computes a Pareto optimal set of job configurations to reveal tradeoffs between different objectives. Using benchmarks developed based on industry needs, our demonstration will allow the user to explore (1) learned models to gain insights into how various parameters affect user objectives; (2) Pareto frontiers to understand interesting tradeoffs between different objectives and how a configuration recommended by the optimizer explores these tradeoffs; (3) end-to-end benefits that UDAO can provide over default configurations or those manually tuned by engineers

INRIA a CCSD electronic archive server

HAL-Polytechnique